Dealing with large data sets can sometimes be confusing. If you are working in spreadsheets the confusion can reach the point of existential crises bordering on pure chaos. Good visualition tools can help. Visualization can allow you to get an overview of your data. It can also help you report patterns and differences in your data.
Needless to say any aims objectives and hypotheses should be determined before any data is collected. Data visualization is a good time to get a clear sense for how your data looks, but is not the time to start making up hypotheses about it.
Here we demonstrate a few different approaches for data visualization. We do this for several types of high dimensional data using plotting functions from tidyverse libraries including ggplot2, plyr and dplyr among others in the R programming language (Wickham 2016, 2011; Wickham et al. 2019; R Core Team 2019).
Plots of high dimensional data do not always need an x-axis to be easy to read. In this case we sometimes compress it to a point using polar coordinates. For showing off options for radial bar plots we created an example data set with a factor variable using the data.frame and sample functions in base R.
DF <- data.frame(variable = as.factor(1:10),
value = sample(10, replace = TRUE))
We also created a function to compute the standard error of the mean to represent some of the uncertainty in the data using the sqrt and length functions in base R and the var from the stats library.
se <- function(x) sqrt(var(x)/length(x))
We use the same data to create a radial bar plot using the functions above and the ggplot2 library.
ggplot(DF, aes(variable, value, fill = variable)) +
geom_bar(width = 1, stat = "identity", color = "white") +
geom_errorbar(aes(ymin = value - se(DF$value),
ymax = value + se(DF$value),
color = variable),
width = .2) +
scale_y_continuous(breaks = 0:nlevels(DF$variable)) +
theme_minimal() +
coord_polar()
Create a data set for radial plots with with three factor variables.
DF2 <- data.frame(name = rep(letters[1:3], length.out = 30),
variable = as.factor(1:5),
factor_variable = rep(letters[4:7], length.out = 30),
value = sample(10, replace = TRUE))
Plot radial plots with three factor variables.
multi_plot <- ggplot(DF2, aes(variable, value, fill = variable)) +
geom_bar(width = 1, stat = "identity", color = "white") +
geom_errorbar(aes(ymin = value - se(DF2$value),
ymax = value + se(DF2$value),
color = variable),
width = .2) +
scale_y_continuous(breaks = 0:nlevels(DF2$variable)) +
theme_minimal() +
coord_polar()
Plot with rows as names and columns as variables factor_variable.
# Rows are name and columns are factor_variable
multi_plot + facet_grid(name ~ factor_variable)
Plot with bars going around the circle.
# Rows are name and columns are factor_variable
multi_plot +
coord_polar(theta="y")+
facet_grid(name ~ factor_variable)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
More on making polar barplots from this blog.
To show a radial box plot with a data set and grid with four factor variables and one continuous.
DF3 <- data.frame(name = rep(letters[1:3], length.out = 600),
variable = as.factor(sample(5, replace = TRUE)),
factor_variable = rep(letters[4:7], length.out = 600),
variable3 = rep(letters[8:16], length.out = 600),
value = sample(50, replace = TRUE))
Plot the radial box plot with ggplot2 functions geom_boxplot() and coord_polar().
multi_plot <- ggplot(data = DF3, aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
scale_y_continuous(breaks = 0:nlevels(DF3$variable)) +
theme_minimal() +
coord_polar()
#call the plot
multi_plot
Radial box plot with rows as names and columns as variables for factor_variable.
multi_plot + facet_grid(name ~ factor_variable)
ToothGrowth dataToothGrowth$dose <- as.factor(ToothGrowth$dose)
DF4 <- ToothGrowth
head(DF4)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
box_plot <- ggplot(DF4, aes(x=dose, y=len, group=dose)) +
geom_boxplot(aes(fill=dose)) +
theme_minimal()+
coord_polar()
Split the radial boxplot in a vertically
#
box_plot + facet_grid(supp ~ .)
Split the radial boxplot horizontally
box_plot + facet_grid(. ~ supp)
To demonstrate a sunburst-style barplot confined to a circle we create small data set using data.frame.
Here is a thread about some more helpful options and scripts for making sunbursts and donut plots.
DF5 <- data.frame(
'level1'=c('a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'),
'level2'=c('a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'c1', 'c2', 'c3'),
'value'=c(.025, .05, .027, .005, .012, .014, .1, .03, .18))
Create a sunburst-style barplot confined to a circle
ggplot(DF5, aes(y=value)) +
geom_bar(aes(fill=level1, x=0), width=.5, stat='identity') +
geom_bar(aes(fill=level2, x=.25), width=.25, stat='identity') +
coord_polar(theta='y') +
theme_minimal()
To demonstrate the spider plot data visualization we create the coord_radar() function1 to obtain straight lines using match.arg() from base R.
coord_radar <-
function(theta = 'x', start = 0, direction = 1){
# input parameter sanity check
match.arg(theta, c('x', 'y'))
ggproto(
NULL, CoordPolar,
theta = theta, r = ifelse(theta == 'x', 'y', 'x'),
start = start, direction = sign(direction),
is_linear = function() TRUE)
}
Create a factor, variable, and value to be plotted in the spider plot using base R functions.
factor <- c(rep("A", 16), rep("B", 16))
variable <- as.factor(c(1:16))
value <- sample(c(1:10), 32, replace = T)
In order to neatly close the plot we add an empty level to the data set (a quasi-blank variable) which needs the same value as level 1. For this to work both factors (“A” and “B” in our case) need this correction.
value[16] <- value[1]
value[32] <- value[17]
We add the factor, variable, and value together with the blank variable to a data set using data.frame.
DF6 <- data.frame(factor = factor, variable = variable, value = value)
Plot with the ggplot2 library.
ggplot(DF6, aes(as.numeric(DF6$variable), value, colour = factor)) +
coord_radar() +
geom_path(size = 1.5) + scale_x_continuous(breaks = c(1:15)) +
labs(x = "variable") +
theme_minimal()
Here we use the OrchardSprays data to run the example from the tidyverse Violin plot examples (Wickham 2017).
ggplot(OrchardSprays, aes(y=decrease, x=treatment, fill=treatment))+
geom_violin()+
geom_boxplot(width=0.1)+
theme(legend.position = "none")
Heatmaps are another way of displaying muti-dimensional data in a singl figure. We use the synthesized data from ethnobotanyR for this heatmap example (Whitney 2019).
#create synthesized use data
eb_data <- data.frame(replicate(10,sample(rnorm(200, mean=1.5, sd=0.5))))
names(eb_data) <- gsub(x = names(eb_data), pattern = "X", replacement = "Use_")
eb_data$informant <- sample(c('User_1', 'User_2', 'User_3'), 200, replace=TRUE)
eb_data$sp_name <- sample(c('s1', 's2', 's3', 's4'), 200, replace=TRUE)
eb_data$year <- sample(c('2018', '2019'), 200, replace=TRUE)
We use the reshape library (Wickham 2007) to melt and geom_tile() function from ggplot2 to plot the resulting heatmap.
#reshape data for the plot
ethno_melt <- reshape::melt(eb_data, id=c("informant","year", "sp_name"))
ggplot(ethno_melt, aes(y = factor(year), x = factor(sp_name))) +
geom_tile(aes(fill = value)) + #heatmap
scale_fill_continuous(low = "blue", high = "green") + #use model result as color
facet_grid(informant ~ variable) + #grid by factor
labs(fill='use') + #legend title
theme_minimal()+
xlab("sp_name") + ylab("")
Here we use a combination of a bubble graph and a heatmap to show several continous variables in the same figure. We start by synthesizing data and conditions for bubble sizes and fill.
#set heat bubble parameters
heat_sq <- sample(c(rnorm (10, 5,1)), 150, replace = T)
circlefill <- heat_sq * 10 + rnorm (length (heat_sq), 0, 3)
circlesize <- heat_sq * 1.5 + rnorm (length (heat_sq), 0, 3)
#synthesize heat bubble data
D7 <- data.frame (rowv = rep (1:10, 15), columnv = rep(1:15, each = 10),
heat_sq, circlesize, circlefill)
As above we use geom_tile() from the ggplot2 library to plot this as a heatmap. In addition we use geom_point() to put bubbles on the heatmap to more continuos variables by adjusting size and color.
ggplot(D7, aes(y = factor(rowv), x = factor(columnv))) +
geom_tile(aes(fill = heat_sq)) +
scale_fill_continuous(low = "blue", high = "red")+
geom_point(aes(colour = circlefill, size =circlesize)) +
scale_color_gradient(low = "green", high = "yellow")+
scale_size(range = c(1, 20))+
theme_bw()
1The coord_radar() function was taken from the question “Closing the lines in a ggplot2 radar / spider chart” from stackoverflow website. https://stackoverflow.com/questions/28898143/closing-the-lines-in-a-ggplot2-radar-spider-chart
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Whitney, Cory. 2019. EthnobotanyR: Calculate Quantitative Ethnobotany Indices. https://github.com/CWWhitney/ethnobotanyR.
Wickham, Hadley. 2007. “Reshaping Data with the Reshape Package.” Journal of Statistical Software 21 (12). http://www.jstatsoft.org/v21/i12/paper.
———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wilke, Claus O. 2018. Ggridges: Ridgeline Plots in ’Ggplot2’. https://CRAN.R-project.org/package=ggridges.